Introduction

This graphical summary has the following sections

  • Preprocessing
  • Topic modeling
  • Document clustering accross topic hierarchy
  • Topic hierarchy visualization

Preprocessing

Steps

  • Full texts were collected to local Zotero library using Zotero connector
  • Text was extracted using PyPDF2
  • Punctuation was removed using nltk and regular expressions.
  • Texts were tokenized using nltk.
  • Tokens that occurred in a document less than 3 or more than 950 times were removed as suggested in Khodorchenko, M. et al. (2020).
  • Additionally tokens that consisted of the same two character combination by more than 50% of length were removed.
    • It was discovered via manual inspection of pre-processed dataset that this step helps to reduce number of uninformative tokens
      (also see Figure 1 and 2).
  • Stop word removal was conducted using stop word list from nltk that was extended to reduce the number of uninformative terms.
  • Coocurrence counts were calculated for both datasets using custom python script with window size of 10 tokens.
  • The corpus and coocurrence counts were saved in a format accepted by the BigARTM library and used to construct hierarchical topic models.

Comments on grid plot structure

  • The grid plots shows number of tokens by the maximum observed fraction
    of token length that consisted of the same two character combination.
  • Each row shows result of a dataset split, given fraction of length threshold.
  • The first column shows the count distribution for tokens above threshold (termed Noise here).
  • The second column shows the count distribution for tokens below threshold (termed Clean here).
  • The third column shows the wordcloud plot for the "Noise" tokens.
  • The fourth column shows the wordcloud plot for the "Clean" tokens.

Figure 1: Repeated dimer filtering - natural language processing dataset

Figure 2: Repeated dimer filtering - bioinformatics dataset

Topic modeling

Document clustering accross the topic hierarchy

  • Hierarchical topic model based on Chirkova, N.A., 2016 allows to calculate a topic distribution for each document in the corpus.
  • Such vectors represent discrete probability distributions and can be used to represent the documents at a given level of topic hierarchy in a way similar to the neural network embeddings.
  • Additionally for each level of a topic hierarchy (except the first one) it is possible to get vectors representing super-topics in terms of sub-topics. This allows to treat super-topics (at the higher level of hierarchy) as pseudo-documents and include them into document matrix. This is the base of the approach for computing topic hierarchy asdescribed by Chirkova, N.A., 2016 and implemented in BigARTM.
  • Calculating Hellinger distance between the described vector (documents and pseudo-documents) was suggested as one of the quality metrics for the model in the original publication.
  • In current woork it was attempted to extend this approach by implementing document similarity calculation using three additional steps:
    • First, given a matrix of topic-based document distributions (termed Phi), and a matrix of pseudo-document distributions (termed Psi) a combined matrix (termed Phi_Psi) is generated.
    • Next the square pairwise distance matrix is calculated from Psi_Phi using Hellinger distance formula.
    • Finally the distance matrix is converted into a similarity matrix using Bhattacharyya coefficient as discussed in Kitsos, C.P. and Nisiotis, C.-S. (2022).
    • Using the resulting similarity matrix it was possible to perform spectral clustering of documents to assign groups of topic-based similarity within each level of topic hierarchy using scikit-learn package.
    • It was also possible to visually represent documents at each level of hierarchy in two-dimensional space using Multidimensional scaling with the distance matrix calculated from Phi_Psi to generate a scatter plot.
    • This plot was additionally annotated with document indices and connections between documents were shown where pairwise Hellinger distance values between the documents were below the specified threshold. The resulting plots are shown in figures 4 and 7.
    • Tables illustrate inf figures 4 and 7 show correspondance between maximum probability topic index and cluster assigned by the described approach.
      • Sankey plots allow to see the discrepancies between spectral clustering results and topic labels with highest probabilities at a given layer.
    • Additional Sankey plot allows to visualize the correspondence between cluster ids assigned at different levels of topic hierarchy, indicating the degree of connectivity between clusters at those levels.

Topic hierarchy visualization

  • An additional function was developed to represent all levels of topic hierarchy and include any connections between the layers with model-assigned probability value above a specified threshold. This allows to show the resulting topic hierarchy and the connections discovered by the model.
  • The resulting plots are shown in figures 5 and 8.

Main conclusions

  • Hierarchical topic modeling based on BigARTM library allows to construct an interpretable topic hierarchy using topic coherence as target metric - confirmed results from Khodorchenko, M. et al. (2020) and Chirkova, N.A., 2016.
  • Additive regularization approach allows to control the parameters of the model, including:

    • Topic sparsity - resulting in set of topics represented by a limited set of application-area-specific tokens.
    • Topic distinctiveness - resulting in uncorrelated token distributions between topics.
    • Sparcity of connections between hierarchy levels - allowing to adjust the degree to which the model expects the supertopic to cover small set of sub-topics vs broad sub-topic range.
  • For the data at hand it can be concluded that the results of proposed spectral clustering approach in most cases discover groups that correspond to the most probable topic assigned by topic model at a given level of the hierarchy (see Sankey plots for clustering results and top-P topic id in figures 4. and 7.).

  • This is an expected result, since topic modeling can be viewed as a form of soft clustering. However the correspondence was demonstrated not to be 100%, which is most likely related to the cases when topic model assigns close values of probability to multiple topics for the same document.
  • The resulting spectral clustering approach is expected to allow converting the soft clustering output of the topic model into hard clustering results more accurately than it could be achived by taking only the most probable topic assigned by the model directly, since the similarity used for spectral clustering was calculated using the entire vector of topic probabilities instead of taking maximum value.
  • The resulting conversion allows to construct Sankey diagram to show the connectivity between document clusters at different levels of topic hierarchy. This allows to visualize the the structure of the hierarchy allong with degree of connectivity between clusters at different levels of topic hierarchy.

Future directions

  • It would be interensting to try optimizing the parameters of the model, specifically:
    • Regularization coefficients.
    • Number of training iterations at each training session.
    • Number of topics at each level of hierarchy.
  • The approach for optimizing regularization coefficients and training iterations could be implemented by maximizing coherence-based target function (perhaps in combination with other quality metrics), as suggested in Khodorchenko, M. et al. (2020)
  • The approach for optimizing number of topics at each level of hierarchy could be based on minimizing Renyi entropy as target function as suggested in Kitsos, C.P. and Nisiotis, C.-S. (2022)

Results for BIOIT set

CPU times: user 20.2 s, sys: 2.98 s, total: 23.2 s
Wall time: 14.4 s
level0

topic_0:  ['data', 'sequencing', 'analysis', 'cells', 'cell', 'dna', 'cancer', 'methods', 'used', 'gene']
topic_1:  ['reads', 'genome', 'read', 'data', 'alignment', 'reference', 'variant', 'sequencing', 'genomes', 'coverage']

level1

topic_0:  ['sequencing', 'dna', 'cancer', 'resistance', 'gene', 'ngs', 'genes', 'using', 'detection', 'protein']
topic_1:  ['variant', 'kraken', 'variants', 'regions', 'normalization', 'benchmark', 'species', 'scone', 'snps', 'wgs']
topic_2:  ['reads', 'genome', 'read', 'alignment', 'assembly', 'coverage', 'genomes', 'reference', 'contigs', 'graph']
topic_3:  ['data', 'analysis', 'cell', 'cells', 'methods', 'metagenomic', 'used', 'expression', 'nat', 'metagenomics']

level2

topic_0:  ['alignment', 'bioinformatics', 'tools', 'algorithms', 'umap', 'mapping', 'short', 'length', 'algorithm', 'fuzzy']
topic_1:  ['genomes', 'contigs', 'lineage', 'contig', 'tree', 'assemblies', 'supplementary', 'assigned', 'samples', 'grapetree']
topic_2:  ['cell', 'cells', 'methods', 'expression', 'clustering', 'number', 'dataset', 'model', 'clusters', 'scvis']
topic_3:  ['variant', 'variants', 'ngs', 'wgs', 'normalization', 'scone', 'calling', 'depth', 'performance', 'lrs']
topic_4:  ['mash', 'sketch', 'aligner', 'fastp', 'hash', 'mappings', 'quality', 'size', 'mapq', 'file']
topic_5:  ['species', 'learning', 'tumor', 'detection', 'deep', 'genomics', 'liquid', 'circulating', 'pubmed', 'patients']
topic_6:  ['metagenomic', 'metagenomics', 'microbiome', 'args', 'benchmark', 'regions', 'resistance', 'microbial', 'usa', 'pubmed']
topic_7:  ['assembly', 'coverage', 'graph', 'set', 'illumina', 'distance', 'bias', 'forensic', 'ajb', 'ion']

level3

topic_0:  ['variant', 'read', 'genome', 'aligner', 'resfinder', 'mappings', 'resistance', 'mapq', 'reference', 'reads']
topic_1:  ['genomes', 'assembly', 'reads', 'using', 'genome', 'mash', 'benchmark', 'regions', 'variants', 'coverage']
topic_2:  ['data', 'clustering', 'cell', 'cells', 'normalization', 'genes', 'methods', 'expression', 'number', 'used']
topic_3:  ['data', 'coverage', 'learning', 'genome', 'deep', 'bias', 'sequencing', 'genomics', 'illumina', 'human']
topic_4:  ['read', 'alignment', 'reads', 'genome', 'reference', 'sequencing', 'algorithms', 'bioinformatics', 'dna', 'scone']
topic_5:  ['sequencing', 'species', 'data', 'wgs', 'reads', 'lrs', 'variant', 'depth', 'using', 'kraken']
topic_6:  ['umap', 'data', 'usa', 'university', 'fuzzy', 'author', 'set', 'qiime', 'manuscript', 'manifold']
topic_7:  ['analysis', 'data', 'cell', 'cells', 'methods', 'sequencing', 'gatk', 'expression', 'gene', 'genome']
topic_8:  ['snps', 'drosophila', 'variant', 'amino', 'megares', 'gene', 'snpeff', 'variants', 'args', 'protein']
topic_9:  ['cells', 'data', 'args', 'resistance', 'scvis', 'resistome', 'cell', 'dataset', 'bipolar', 'clusters']
topic_10:  ['cancer', 'dna', 'data', 'sequencing', 'analysis', 'tumor', 'detection', 'pubmed', 'liquid', 'circulating']
topic_11:  ['reads', 'graph', 'assembly', 'ajb', 'contigs', 'bruijn', 'distance', 'spades', 'genome', 'edge']
topic_12:  ['metagenomic', 'analysis', 'data', 'sequencing', 'metagenomics', 'microbiome', 'used', 'microbial', 'dna', 'pubmed']
topic_13:  ['data', 'nat', 'methods', 'integration', 'analysis', 'cell', 'reduction', 'sequencing', 'cells', 'joint']
topic_14:  ['sequencing', 'dna', 'ngs', 'cancer', 'analysis', 'data', 'genome', 'forensic', 'technology', 'variant']
topic_15:  ['kraken', 'reference', 'data', 'sequences', 'genes', 'used', 'lineage', 'sequence', 'genomes', 'genome']

Figure 3. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.381 
Sparsity Theta: 0.000
Kernel contrast: 0.891
Kernel purity: 0.938
Results for level 1

Sparsity Phi: 0.567 
Sparsity Theta: 0.000
Kernel contrast: 0.846
Kernel purity: 0.890
Results for level 2

Sparsity Phi: 0.735 
Sparsity Theta: 0.009
Kernel contrast: 0.839
Kernel purity: 0.879
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.461
Kernel purity: 0.388

Figure 4. Spectral clustering results - bioinformatics dataset

doc_names cluster_id max_p_topic_id
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 0 0
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 0 0
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 0 0
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 0 0
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 0 0
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 0 0
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 0
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 0 0
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 0 0
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 0 0
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 0 0
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 0 0
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 0 0
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 0 0
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 0 0
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 0 0
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 0 0
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 0 0
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 0 0
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 0 0
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 0 0
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 0 0
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 0 0
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 0 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 0 0
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 0 0
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 0 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 1 1
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 1 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 1 1
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 1 1
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 1 1
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 1 1
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 1 1
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 1 1
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 1 1
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 1 1
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 1 1
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 1 1
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 1 1
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 1 1
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 1 1
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 1 1
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 1
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 1 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 1 1
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 1 1
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 1 1
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 1 1
doc_names cluster_id max_p_topic_id
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 0 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 0 2
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 0 2
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 0 2
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 0 2
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 2
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 0 2
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 0 2
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 0 2
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 0 2
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 0 2
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 0 2
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 0 2
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 0 2
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 0 3
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 0 3
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 1 0
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 1 0
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 1 0
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 1 0
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 1 0
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 1 0
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 1 0
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 1 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 1 0
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 1 0
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 2 1
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 2 1
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 2 1
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 2 1
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 2 1
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 2 1
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 2 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 2 1
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 2 1
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 2 1
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 2 1
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 3 3
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 3 3
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 3 3
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 3 3
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 3 3
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 3 3
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 3 3
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 3 3
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 3 3
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 3 3
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 3 3
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 3 3
doc_names cluster_id max_p_topic_id
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 4
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 0 4
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 0 4
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 0 4
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 1 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 1 1
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 1 1
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 1 1
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 1 1
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 1 1
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 1
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 1 1
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 1 6
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 2 0
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 2 0
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 2 0
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 2 0
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 2 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 3 3
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 3 5
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 3 5
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 3 5
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 3 5
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 4 3
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 4 3
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 4 3
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 4 3
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 5 6
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 5 7
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 5 7
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 5 7
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 5 7
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 5 7
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 5 7
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 6 2
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 6 2
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 6 2
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 6 2
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 6 2
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 6 2
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 6 2
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 7 6
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 7 6
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 7 6
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 7 6
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 7 6
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 7 6
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 7 6
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 7 6
doc_names cluster_id max_p_topic_id
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 0 8
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 0 8
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 1 11
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 1 11
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 1 11
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 11
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 2 6
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 2 6
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 3 0
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 3 12
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 3 12
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 3 12
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 4 14
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 4 14
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 4 14
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 5 15
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 5 15
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 5 15
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 5 15
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 6 13
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 7 7
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 7 7
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 7 7
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 8 2
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 8 2
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 8 2
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 9 5
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 9 5
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 9 5
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 9 5
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 10 10
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 10 10
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 10 10
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 10 10
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 10 10
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 11 3
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 11 3
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 11 3
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 12 9
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 12 9
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 13 0
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 13 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 14 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 14 1
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 14 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 14 1
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 15 4
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 15 4
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 15 4
doc_names cluster_id_0 cluster_id_1 cluster_id_2 cluster_id_3
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 1 0 1 14
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 1 0 2 15
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 0 0 5 1
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 0 3 6 11
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 0 0 5 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 1 0 1 14
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 0 3 7 2
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 1 2 7 0
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 0 1 3 10
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 0 1 3 11
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 0 0 10
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 1 2 7 0
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 1 2 4 8
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 1 1 2 9
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 0 3 6 12
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 0 0 1 7
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 0 1 1 5
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 0 1 7 13
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 0 3 6 6
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 0 3 1 10
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 0 0 5 1
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 1 2 4 9
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 1 2 4 4
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 1 0 2 15
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 0 3 6 8
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 0 3 5 3
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 1 2 3 9
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 0 3 2 2
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 1 0 6 7
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 1 0 1 5
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 0 3 7 3
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 0 3 7 3
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 0 3 6 7
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 0 1 4 9
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 1 0 0 14
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 1 1 7 12
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 0 2 1 5
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 1 0 2 15
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 0 1 3 10
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 0 1 1
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 1 0 5 11
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 0 0 5 3
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 0 1 3 4
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 1 2 7 14
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 1 2 0 13
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 1 2 0 5
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 1 1 5 4
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 0 3 6 8
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 0 2 1 10

Figure 5. Topic hierarchy structure - bioinformatics dataset

Results for NLP set

CPU times: user 1min, sys: 15.8 s, total: 1min 16s
Wall time: 27.8 s
level0

topic_0:  ['data', 'network', 'social', 'used', 'learning', 'spam', 'information', 'embedding', 'networks', 'research']
topic_1:  ['model', 'proceedings', 'conference', 'information', 'learning', 'knowledge', 'text', 'language', 'data', 'methods']

level1

topic_0:  ['data', 'network', 'embedding', 'information', 'news', 'graph', 'clustering', 'learning', 'models', 'networks']
topic_1:  ['clinical', 'social', 'patent', 'text', 'model', 'classification', 'data', 'spam', 'learning', 'detection']
topic_2:  ['knowledge', 'model', 'articles', 'tax', 'used', 'training', 'disease', 'set', 'research', 'quality']
topic_3:  ['proceedings', 'conference', 'language', 'extraction', 'computational', 'knowledge', 'association', 'linguistics', 'learning', 'methods']

level2

topic_0:  ['clustering', 'areas', 'topic', 'institutions', 'recommendation', 'terms', 'thematic', 'technology', 'performance', 'recommendations']
topic_1:  ['model', 'text', 'information', 'used', 'models', 'using', 'clinical', 'conference', 'classification', 'language']
topic_2:  ['event', 'sket', 'reports', 'pathology', 'engineering', 'events', 'argument', 'cancer', 'topics', 'archetype']
topic_3:  ['articles', 'example', 'dream', 'citation', 'training', 'article', 'prefiltering', 'traf', 'trial', 'sampling']
topic_4:  ['knowledge', 'proceedings', 'extraction', 'conference', 'computational', 'language', 'methods', 'entity', 'relation', 'concept']
topic_5:  ['social', 'spam', 'tax', 'detection', 'features', 'twitter', 'techniques', 'users', 'accounts', 'cases']
topic_6:  ['patent', 'questions', 'question', 'problem', 'patents', 'modeling', 'class', 'study', 'problems', 'classification']
topic_7:  ['network', 'networks', 'graph', 'embedding', 'nodes', 'node', 'disease', 'representation', 'gcn', 'drug']

level3

topic_0:  ['areas', 'institutions', 'recommendation', 'thematic', 'recommendations', 'system', 'set', 'collaboration', 'technology', 'institution']
topic_1:  ['example', 'dream', 'house', 'dreams', 'situation', 'reports', 'flying', 'situations', 'falling', 'groups']
topic_2:  ['political', 'model', 'text', 'classification', 'detection', 'work', 'seed', 'label', 'data', 'policy']
topic_3:  ['construction', 'research', 'data', 'text', 'argument', 'analysis', 'nlp', 'documents', 'media', 'mining']
topic_4:  ['proceedings', 'conference', 'extraction', 'learning', 'computational', 'information', 'language', 'association', 'word', 'methods']
topic_5:  ['data', 'news', 'clustering', 'model', 'set', 'methods', 'online', 'models', 'patent', 'problem']
topic_6:  ['data', 'clinical', 'clustering', 'trial', 'emr', 'patient', 'patients', 'vector', 'medical', 'trials']
topic_7:  ['patent', 'question', 'questions', 'word', 'words', 'summarization', 'based', 'model', 'information', 'data']
topic_8:  ['knowledge', 'entity', 'resolution', 'subjectivity', 'methods', 'tax', 'concept', 'entities', 'semantic', 'anaphora']
topic_9:  ['clinical', 'knowledge', 'concept', 'argumentative', 'literature', 'inform', 'mining', 'med', 'disease', 'learning']
topic_10:  ['model', 'articles', 'data', 'topic', 'training', 'citation', 'article', 'used', 'research', 'topics']
topic_11:  ['model', 'models', 'medical', 'clinical', 'bert', 'text', 'biomedical', 'classification', 'language', 'embeddings']
topic_12:  ['learning', 'network', 'embedding', 'graph', 'networks', 'node', 'nodes', 'data', 'information', 'representation']
topic_13:  ['event', 'social', 'lockdown', 'class', 'ratio', 'learning', 'data', 'media', 'events', 'distancing']
topic_14:  ['spam', 'social', 'detection', 'features', 'patent', 'classification', 'learning', 'dataset', 'used', 'text']
topic_15:  ['sket', 'reports', 'pathology', 'data', 'disease', 'cancer', 'network', 'concepts', 'networks', 'fication']

Figure 6. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.373 
Sparsity Theta: 0.000
Kernel contrast: 0.871
Kernel purity: 0.914
Results for level 1

Sparsity Phi: 0.570 
Sparsity Theta: 0.000
Kernel contrast: 0.827
Kernel purity: 0.780
Results for level 2

Sparsity Phi: 0.686 
Sparsity Theta: 0.000
Kernel contrast: 0.816
Kernel purity: 0.787
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.471
Kernel purity: 0.402

Figure 7. Spectral clustering results - NLP dataset

doc_names cluster_id max_p_topic_id
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 0 1
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 0 1
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 0 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 0 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 0 1
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 0 1
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 0 1
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 0 1
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 0 1
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 0 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 0 1
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 0 1
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 0 1
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 0 1
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 0 1
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 0 1
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 0 1
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 0 1
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 0 1
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 0 1
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 0 1
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 0 1
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 0 1
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 0 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 0 1
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 1 0
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 1 0
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 1 0
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 1 0
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 1 0
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 1 0
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 1 0
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 1 0
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 1 0
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 1 0
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 1 0
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 1 0
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 1 0
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 1 0
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 1 0
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 1 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 1 0
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 1 0
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 1 0
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 1 0
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 1 0
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 1 0
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 1 0
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 1 0
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 1 0
doc_names cluster_id max_p_topic_id
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 0 2
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 0 3
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 0 3
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 0 3
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 0 3
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 0 3
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 0 3
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 0 3
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 1 0
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 1 0
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 1 0
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 1 0
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 1 0
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 1 0
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 1 0
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 1 0
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 1 0
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 1 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 1 0
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 1 0
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 1 0
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 1 0
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 1 0
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 1 0
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 1 0
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 1 2
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 2 2
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 2 2
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 2 2
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 2 2
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 2 2
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 2 2
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 2 2
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 2 2
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 3 1
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 3 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 3 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 3 1
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 3 1
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 3 1
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 3 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 3 1
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 3 1
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 3 1
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 3 1
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 3 1
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 3 1
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 3 1
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 3 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 3 1
doc_names cluster_id max_p_topic_id
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 0 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 0 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 0 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 0 1
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 0 1
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 0 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 0 1
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 1 4
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 1 4
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 1 4
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 1 4
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 2 1
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 2 2
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 2 2
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 2 2
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 2 2
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 2 2
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 2 2
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 2 2
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 2 2
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 2 7
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 3 5
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 3 5
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 3 5
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 3 5
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 3 5
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 3 5
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 3 5
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 3 5
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 4 0
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 4 0
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 4 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 4 0
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 4 0
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 4 0
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 5 7
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 5 7
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 5 7
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 5 7
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 5 7
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 5 7
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 6 3
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 6 3
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 6 3
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 6 3
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 7 6
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 7 6
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 7 6
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 7 6
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 7 6
doc_names cluster_id max_p_topic_id
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 0 7
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 0 7
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 0 7
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 0 7
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 0 7
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 1 0
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 2 2
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 2 2
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 3 5
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 3 5
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 3 5
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 3 5
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 4 1
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 5 6
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 5 6
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 5 6
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 5 6
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 6 15
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 6 15
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 6 15
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 7 3
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 7 3
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 7 3
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 7 3
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 8 10
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 8 10
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 8 10
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 8 10
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 8 10
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 9 9
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 9 9
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 9 9
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 10 4
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 10 14
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 11 8
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 11 8
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 11 8
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 12 13
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 12 13
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 12 13
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 13 12
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 13 12
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 13 12
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 13 12
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 14 14
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 14 14
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 15 11
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 15 11
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 15 11
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 15 11
doc_names cluster_id_0 cluster_id_1 cluster_id_2 cluster_id_3
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 1 0 3 9
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 1 1 5 13
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 0 1 5 15
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 1 0 0 7
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 1 1 7 3
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 1 1 4 3
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 0 1 2 8
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 1 3 4 0
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 0 0 1 10
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 1 3 6 5
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 1 1 3 13
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 0 3 0 2
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 0 3 0 9
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 1 2 4 8
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 1 2 5 6
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 1 0 3 7
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 0 3 7 3
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 1 0 6 4
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 0 0 2 12
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 0 1 5 15
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 1 1 5 5
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 1 1 2 0
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 1 1 2 0
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 1 3 3 12
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 1 1 4 5
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 0 2 4 1
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 0 3 7 12
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 0 3 0 15
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 0 1 2 8
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 1 3 2 7
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 0 3 2 6
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 0 0 1 11
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 0 3 2 6
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 0 1 2 5
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 0 2 6 8
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 0 1 0 10
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 0 2 7 0
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 1 3 3 13
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 1 1 2 0
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 0 2 4 9
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 1 3 3 14
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 0 3 7 14
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 1 2 3 11
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 0 1 1 3
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 0 0 1 11
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 1 3 3 7
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 0 1 0 15
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 1 1 5 13
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 0 3 0 2
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 1 2 6 8

Figure 8. Topic hierarchy structure - NLP dataset